Home
Column
Gender Classification via Voice
Jake Whalen
CS 584 Final Project
Fall 2017
Start
Summary
Column
Choosing a Project
-
Topic? Sports, Beer, Other
-
Supervised or unsupervised learning?
-
Data Source: Download, Web Scrape, Social Media
-
Tools: Python, R, Weka, Tableau, Excel
Choice
-
Data from Kaggle
-
Audio Analysis
-
Supervised Learning
-
Classification
-
ML in Python
-
Presentation & Report in R Markdown
-
Excel for results transfer
Goals
-
Classify audio clip subjects gender
-
Learn what features best seperate gender in audio
-
Look for other potential clusters within the data
Method
Column
Exploration
-
Read the data into R/Python
-
Ran summary functions on features
-
Plotted the data
-
Look for patterns and relationships between features
-
Determine what features seperate genders best
Column
Classification
-
Used Scikit-learn in Python
-
Split the data for training/testing (2/3, 1/3)
-
Used gridsearch to identify the best parameters
-
KNN (K-Nearest Neighbors)
-
Decision Tree (DT)
-
Suport Vector Machine (SVM)
-
Logistic Regression (Log R)
-
Observed prediction outcomes. Could do better.
-
Attempt to Improve on initial results
-
KNN: Transform data with PCA
-
Decision Tree: Use multiple trees with Random Forest
-
SVM: Transform data with PCA
-
Log R: Normalized data from 0 to 1 in each feature
Column
Review
-
Confusion Matrix
-
Overall Accuracy Scores
-
Male Accuracy
-
Female Accuracy
-
ROC & AUC
-
Parameter Influence
-
Fit & Score Times
Overview
Column
Description
Dataset Comments
-
Database created to identify a voice as male or female, based upon acoustic properties of the voice and speech.
-
The dataset consists of 3,168 recorded voice samples, collected from male and female speakers.
-
The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).
-
The samples are represented by 21 different features
-
Source: Voice Gender Data
EDA
Column
Classes

Distributions

Boxplots

Heatmap

Scatter Plot

KNN
Column
K-Nearest Neighbors
Summary
-
Used untransformed data
-
Better then a dumb classifier (50/50)
-
Distance weights outperformed Uniform weights
-
P: Manhattan Distance produced better CV results (p=1)
-
Algorithm: auto attempts to decide the most appropriate algorithm based on values
-
Weights: distance weights points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away
Best Parameters
-
algorithm = auto, n_neighbors = 11, p = 1, weights = distance
Decision Tree
Column
Decision Tree
Summary
-
Used untransformed data
-
MeanFun, sp.ent & IQR account for +90% of feature importance
-
Presort: presort the data to speed up the finding of best splits in fitting
-
Splitter: The strategy used to choose the split at each node
-
Better at identifying males
-
Easiest model to interpret (follow the branches)
-
Tree
Best Parameters
-
criterion = gini, max_depth = 21, presort = TRUE, splitter = random
SVM
Column
Support Vector Machine
Summary
-
Modified penalty parameter to acheive better results
-
Higher penalties acheived better scores
-
Better at classifying males
Best Parameters
Log Reg
Column
Logistic Regression
Summary
-
Untransformed data
-
Best Male accuracy
-
Outperformed Log Reg (Normal)
-
C: Inverse of regularization strength
-
fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function
-
penalty: Used to specify the norm used in the penalization
Best Parameters
-
C = 0.7000000000000001, fit_intercept = TRUE, penalty = l1
Random Forest
Column
Random Forest
Summary
-
Best Female accuracy
-
Took longer to fit then Decision Tree
Best Parameters
-
criterion = entropy, max_depth = 9, n_estimators = 15
KNN (PCA)
Column
K-Nearest Neighbors (PCA)
Summary
-
Best overall accuracy
-
9 PCA components used
-
The fewer the neighbors the better
Best Parameters
-
algorithm = auto, n_neighbors = 3, p = 1, weights = distance
SVM (PCA)
Column
Support Vector Machine (PCA)
Summary
-
Improvement over SVM on untransformed data
-
Adjusted penalty parameter C of the error term
-
Acheived best performance at much lower penalty parameter levels
Best Parameters
Log Reg (Normal)
Column
Logistic Regression (Normalized)
Summary
-
Performed worse then Log Regression on untransformed data
-
Decrease in performance due to decrease in Male accuracy
-
Slight improvement in Female accuracy compared to first Log R
Best Parameters
-
C = 0.9000000000000001, fit_intercept = TRUE, penalty = l1
Conclusions
Criteria
Accuracy
-
KNN (PCA)
-
Random Forest
-
Log Regression
Male Accuracy
-
Log Regression
-
Log Regression (Normal)
-
KNN (PCA)
Female Accuracy
-
Random Forest
-
KNN (PCA)
-
Log Regression (Normal)
AUC
-
Random Forest
-
Log Regression
-
Log Regression (Normal)
ROC
Area Under the Curve
-
KNN: 0.8899249
-
Decision Tree: 0.9606488
-
SVM: 0.9611217
-
Log Reg: 0.9961107
-
KNN (PCA): 0.9921023
-
Random Forest: 0.9979454
-
SVM (PCA): 0.9930792
-
Log Reg (Normal): 0.9955755
Conclusion
Best Model
-
Best Model: Random Forest
-
2nd highest overall accuracy
-
1st Female accuracy
-
Highest Area Under the Curve
-
Decent Fitting Time
-
Faster Scoring Time
Improvements
-
Focus on a single method
-
Combine features to create new ones
-
Implement more advanced methods (Bagging/Boosting)
-
Extract features from raw aaudio files